Inter-package dependency networks in 
open-source software 



Nathan LaBelle a '* Eugene Wallingford : 

Vj-" 

[ ^Computer Science Department, University of Northern Iowa, Cedar Falls, Iowa 

P '■ 50613 
(N 

> 
O 

^ ■ Abstract 

o\ 

CN ' This research analyzes complex networks in open-source software at the inter- 

package level, where package dependencies often span across projects and between 
development groups. We review complex networks identified at "lower" levels of 
abstraction, and then formulate a description of interacting software components 
at the package level, a relatively "high" level of abstraction. By mining open-source 
software repositories from two sources, we empirically show that the coupling of 
modules at this granularity creates a small-world and scale-free network in both 
J> ■ instances. 
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1 Introduction and Previous Research 



In recent years the identification and categorization of networks has become 
an emerging research area in fields as diverse as sociology and biology, but 
has remained relatively unutilized in software engineering. The study and cat- 
egorization of software systems as networks is a promising field, as the the 
identification of networks in software systems may prove to be a valuable tool 
in managing the complexity and dynamics of software growth, which have 
traditionally been problems in software engineering. However, current trends 
in software development offer diverse and accessible software to study, which 
may help software engineers learn how to create better programs. In partic- 
ular, open-source software (OSS) allows researchers access to a rich set of 
examples that are production-quality and studiable "in the wild" . They are a 
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valuable asset that can aid in the study of software development and managing 
complexity. 

In OSS systems, applications are often distributed in the form of packages. 
A package is a bundle of related components necessary to compile or run an 
application. Because resource reuse is naturally a pillar of OSS, a package is 
often dependent on some other packages to function properly. These packages 
may be third-party libraries, bundles of resources such as images, or Unix 
utilities such as grep and sed. Package dependencies often span across project 
development teams, and since there is no central control over which resources 
from other packages are needed, the software system self-organizes in to a 
collection of discrete, interconnected components. This research applies com- 
plex network theory to package dependency networks mined from two OSS 
repositories. 

A network is a large (typically unweighted and simple) graph G = (V, E) where 
V denotes a vertex set and E an edge set. Vertices represent discrete objects 
in a dynamical system, such as social actors, economic agents, computer pro- 
grams, or biological producers and consumers. Edges represent interactions 
among these "interactons" . For example, if software objects are represented 
as vertices, edges can be assembled between them by defining some mean- 
ingful interaction between the objects, such as inheritence or procedure calls 
(depending on the nature of the programming language used). 

Real-world networks tend to share a common set of non-trivial properties: 
they have scale-free degree distributions and exhibit the small- world effect. 
The degree of a vertex v, denoted k, is the number of vertices adjacent to v, 
or in the case of a digraph either the number of incoming edges or outgoing 
edges, denoted ki n and k out , respectively. In real- world networks such as the 
Internet [6], the World-Wide Web [1], software objects [10,13,16,15], networks 
of scientic citations [8,14], the distribution of edges roughly follows a power- 
law: P(k) oc k~ a . That is, the probability of a vertex having k edges decays 
with respect to some constant a G R + . This is significant because it shows 
deviation from randomly constructed graphs, first studied by Erdos and Renyi 
and proven to take on a Poisson distribution in the limit of large n, where 
n = \V\ [2]. 

Random connection models also fail to explain the "small-world effect" in real 
networks, the canonical examples being social collaboration networks [11,12], 
certain neural networks [17], and the World-Wide Web [1]. The small-world 
effect states that C ran ^ om <C C sw and L ran( i om ~ L sw where C is the cluster- 
ing coefficient of a graph, and L is the s characteristic path length [17]. The 
clustering coefficent is the propensity for neighbors u, w G V of a vertex v 
to be connected to each other. For a vertex v, we can define the clustering 
coefficent as C v = and therefore C v G [0,1]. The clustering coefficient 



2 



for a graph is the average over all vertices, C = -J2 v evC v - Real- world net- 
works are normally highly clustered while random networks are not, because 
Crandom = \ for large networks [2]. Because most networks are sparse, that is 
n ^> k, random networks are not highly clustered. L is the average geodesic 
(unweighted) distance between vertices. 

To summarize, random graphs are not small- world because they are not highly 
clustered (although they have short path lengths) and they are do not follow 
the commonly observed power-law because the edge distribution is Poissonian. 
The presence of these features in networks indicate non-random creation mech- 
anisms, which although several models have been proposed, none is agreed 
upons. In order to make accurate hypothesis about possible network creation 
mechanisms, a wide variety of real-world networks sharing these non-trivial 
properties should be identified. 

Previous research in networks of software have focused on software at "low" 
levels of abstraction (relative to the current research). Clark and Green [4] 
found Zipf distributions (a ranking distribution similar to the power-law, which 
is also found in word frequencies in natural language [18]) in the structure 
of CDR and CAR lists in large Lisp programs during run-time. In the case 
of object-oriented programming languages, several studies [10,13,15,16] have 
identified the small-world effect and power-law edge distribution in networks 
of objects or procedures where edges represent meaningful interconnection 
between objects, such as inheritence or in the case in procedural languages, 
procedures are represented as vertices and edges between vertices symbolize 
function calls. Similar statistical features have also been identified in networks 
where the vertices represent source code files on a disk and edges represent 
a dependency between files (for example, in C and C++ one source file may 
#include another) [9], and in documentation systems [16]. 



2 Package Dependency Networks 

Mining the Debian GNU/Linux software repository [5] and the FreeBSD Ports 
Collection [3] has allowed us to create networks of package dependencies. In 
the case of the Debian repository, data was taken from the i386 branch of 
the "unstable" tree, which contains the most up-to-date software and is the 
largest branch. The Debian data was extracted using apt (Advanced Packaging 
Tool), while the BSD data was extracted from the ports INDEX system. The 
BSD Ports system allowed us to distinguish between run-time dependencies 
and compile-time (build) dependencies. The data here is for only compile-time 
dependencies, although results are similar for run-time dependencies. Graphs 
were constructed in Java using the Java Universal Network/Graph framework 
[7]. "Snapshots" of the repositories were taken during the month of September, 



3 



2004. 



The Debian network contains n — 19, 504 packages and m = 73, 960 edges, 
giving each package an average coupling to 3.79 packages. For the Debian 
network, C = 0.52 and L = 3.34. This puts the Debian network in the small- 
world range, since an equivalent random graph would have C ran d om ~ -0019 
and L ran d om rs 7.41. There are 1,945 components, but the largest component 
contains 88% of the vertices. The rest of the vertices are disjoint from each 
other, resulting in a large number of components with only 1 vertex. The 
diameter of the largest component is 31. The distribution of outgoing edges, 
which is a measure of dependency to other packages, follows a power-law with 
a out ~ 2.33. The distribution of incoming edges, which measures how many 
packages are dependent on a package, follows a power-law with cti n ~ 0.90. 
While 10,142 packages are not referenced by any package at all, the most 
highly referenced packages are referenced thousands of times. 73% of packages 
depend on some other package to function correctly. Correlation between k in , 
k out , and package size is not calculated because the normality assumption is 
violated. 
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The BSD compile-time dependency network contains n — 10, 222 packages 
and m = 74, 318 edges, coupling each package to an average of k — 7.27 
other packages. For the BSD network, C xs 0.56 and L 2.86. An equivalent 
random graph would have C ran dom ~ 0.007 and L ran d om ~ 7.11. Hence, the 
BSD network is small-world. The degree distribution of the BSD network also 
resembles a power-law, with a in w 0.62 and a out ~ 1.28. For the run-time 
network, results were similar: the run-time network is both small-world and 
follows a power-law. 

In the Debian network, the 20 most highly depended-upon packages are libc6 
(7861), xlibs (2236), libgccl (1760), zliblg (1701), libxll-6 (1446), perl (1356), 
libxext6 (1110), debconf (1013), libice6 (922), libsm6 (919), libglib2.0-0 (859), 
libpngl2-0 (622), libncurses5 (616), libgtk2.0-0 (615), libpangol.0-0 (610), 
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Fig. 1. Log-log scatterplot of ki n and k ou t (respectively) for the Debian network 



libatkl.0-0 (602), libglibl.2 (545), libxrnL? (538), libart-2.0-2 (524), and libgtkl.2 
(474). The number in parentheses represents the number of incoming edges. 
The list is composed mainly of libraries that provide some functionality to 
programs such as XML parsing or that provide some reusable components 
such as graphical interface widgets. Because the most highly-connected pack- 
age (libc6) is required for execution of C and C++ programs, we can infer 
that these are the most widely used programming languages. 

Figure 1 shows the double-log distribution of edges in the Debian network 
(scatterplots for the BSD network would have a similar shape). From the 
figure we can see the heavy-tailed power-law shape. The absolute value of the 
slope of the regression line indicates the power-law exponent, a. 



3 Conclusion and Discussion 

This research has shown that package dependency networks mined from two 
open-source software repositories share the following properties typical to 
other real-world networks: 

• The small- world effect: short geodesic path lengths and high clustering. 

• Near power-law distribution of edges. 

• The presence of a giant component, \V E Q\\ ^> \V G 2 1 

There are many directions for future research in the study of software net- 
works. Currently, there is no model of network formation that takes software 
dynamics (reuse, refactoring, addition of new packages) in to account. Also, 
the impact of the network structure on software dynamics should be investi- 
gated. Future research should identify other networks in software and move 
towards formulating a theory of networks and their value to software engi- 
neering. Additional dependency networks can be constructed on Windows 
computers using memory profiling tools, and determining interactions based 
on shared .DLL (Dynamic Library Link) files and Active-X controls. 
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